CRAG: de novo characterization of cell-free DNA fragmentation hotspots in plasma whole-genome sequencing

The fine-scale cell-free DNA fragmentation patterns in early-stage cancers are poorly understood. We developed a de novo approach to characterize the cell-free DNA fragmentation hotspots from plasma whole-genome sequencing. Hotspots are enriched in open chromatin regions, and, interestingly, 3′end of transposons. Hotspots showed global hypo-fragmentation in early-stage liver cancers and are associated with genes involved in the initiation of hepatocellular carcinoma and associated with cancer stem cells. The hotspots varied across multiple early-stage cancers and demonstrated high performance for the diagnosis and identification of tissue-of-origin in early-stage cancers. We further validated the performance with a small number of independent case–control-matched early-stage cancer samples. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-022-01141-8.


Supplementary Methods
The saturation analysis of the fragment number required for CRAG.
A group of fragmentation-positive and fragmentation-negative regions was generated for the benchmark. For fragmentation-positive regions, we chose the CGI TSS that are overlapped with conserved TssA chromHMM states (15-state chromHMM) shared across the cell types from NIH Epigenome Roadmap. Regions that are -50bp to +150bp around these active TSS were defined as the fragmentation-positive regions. For fragmentation-negative regions, we chose the same number of random genomic regions from conserved Quies chromHMM states shared across the cell types but with the same chromosome, region size, G+C% content, and mappability score as that in fragmentation-positive regions.
We downsampled the high-quality fragments in the BH01 dataset to 50 million fragments. We identified the hotspots at these downsampled datasets and calculated TP (true positive), FP (false positive), TN (true negative), FN (false negative) based on their overlaps with the benchmark regions generated above. F-score was calculated: (1) in which, Precision and Recall were calculated using equation (2) and equation (3), respectively: The performance is saturated at ~0.9 F1-score with 200 million high-quality fragments. (Additional file 2: Fig. S1).
The enrichment analysis of the cfDNA fragmentation hotspots in gene-regulatory elements.
The number of hotspots that overlapped with the regulatory element was counted by bedtools v2 1 . After filtering out the dark regions and low mappability regions (mappability less than 0.9), random genomic regions were generated with matched chromosomes and sizes. Fisher exact test (two-tail) was performed to calculate the enrichment of hotspots over the matched random regions.
The Principal Component Analysis of the cfDNA fragmentation hotspots across different diseases.
The cfDNA fragmentation hotspots were called at each pathological condition as described in the Methods. Principal Component Analysis (PCA) was performed on the z-score transformed IFS across all the fragmentation hotspots (pca function at Matlab 2019b).
Unsupervised hierarchical clustering analysis of the cfDNA fragmentation hotspots across different diseases.
The cfDNA fragmentation hotspots were called at each pathological condition as described in the Methods. Top N most variable hotspots were kept for the clustering (ranked by the variation across all the samples). Spearman's rank correlation was utilized to evaluate the distance among the samples. Also, weighted average distance (WPGMA, with 'weighted' as the parameter in clustergram function at Matlab 2019b) was applied together with the linkage method. In the Cristiano et al. dataset, one-way ANOVA (p-value < = 0.01) was applied to select the hotspots that showed the group-specific fragmentation patterns. Further, hotspots are ranked by the z-score difference between the samples within the group and outside the group. The top 5,000 hotspots in each group were finally visualized in the figure.
The t-SNE visualization of the cfDNA fragmentation hotspots across different diseases.
T-SNE (tsne function at Matlab 2019b) was utilized for the dimensionality reduction and visualization of the fragmentation dynamics in the hotspots across multiple cancer and healthy conditions. Hotspots with a p-value <= 0.01 (one-way ANOVA) were used for the analysis. Distance similarity was calculated by the Spearman correlation together with default parameters (tsne function at Matlab 2019b).

The Gene Ontology and pathway analysis of the cfDNA fragmentation hotspots.
Gene Ontology (GO) and KEGG pathway analysis of the cfDNA fragmentation hotspots was performed by Cistrome-GO 2 .
The motif analysis of the cfDNA fragmentation hotspots.
Motif analysis of cfDNA fragmentation hotspots was performed by HOMER (v4.11) with the command 'findMotifsGenome.pl hotspots_file hg19 output_file -size given' 3 . Only motifs with a q-value of less than 0.01 were kept.

The batch effect correction and the performance evaluation at the independent test dataset.
Principal Component Analysis (PCA) was performed on the z-score transformed IFS across all the fragmentation hotspots in public dataset and independent test dataset. There is indeed a batch effect between the training set (public data) and our independent test set (Additional file 2: Fig. S19a). Therefore, we performed a batch effect correction. Specifically, we used num.sv and sva functions in sva package (R v4.2.0) to estimate the surrogate variables. num.sv suggests that there is only one major hidden surrogate variable. The underlying surrogate variable showed a high correlation with the data source and confirmed that the data source is indeed the only major contributor to the batch effect we observed (we also infer the gender by coverage ratio in chrX/chrY from the data). Finally, we apply the ComBat algorithm to correct the batch effects between the train and test datasets by their data source. After the correction, we randomly choose 2/3 of the samples in the training dataset (public data) to train the model (SVM classifier with linear kernel and default parameters). To balance the case and control categories, we down-sampled them to be equal during the training process. We evaluated the performance at the remaining 1/3 of the training set and chose the fixed cut-off to identify the positive and negative labels in the remaining 1/3 of the training set. We further applied this model and fixed the cut-off at the independent test set to vote for the positive and negative labels for the test samples. We repeated this process 100 times. Samples that received more than half of the positive votes (>=50 votes) will be classified as cancer, otherwise controls. The performance is shown in Additional file 1: Table S11. The high performance in the test set suggests that our method indeed has the potential clinical application.        The Odds ratio is compared between lung-cancer-specific (Class II) hotspots over healthyspecific (Class I) hotspots. P-value is calculated based on Fisher's exact test.                Table. S4. Performance evaluation for the detection of early-stage HCC (before GC bias correction). Table. S5. Performance evaluation for the detection of early-stage HCC (after GC bias correction). Table. S6. Performance evaluation to distinguish early-stage HCC with benign conditions (HBV-associated liver cirrhosis and chronic HBV infection). Table. S7. Performance evaluation to distinguish early-stage HCC with benign conditions (HBV-associated liver cirrhosis and chronic HBV infection) (after GC bias correction). Table. S8. Performance evaluation for the detection of multiple cancer types (after GC bias correction). Table. S9. Performance evaluation for the detection of multiple cancer types (before GC bias correction).

Supplementary Figures
Table. S10. Performance evaluation at the independent validation sets (after GC bias correction).
Table. S11. Performance evaluation with a fixed cut-off at both training and test set (after batch effect correction). Table. S12. Performance evaluation for the localization of five or six cancer types.
Data file S2. The collections of cfDNA fragmentation files and their hotspots identified at different conditions in the study. Available in Zenodo.org (https://doi.org/10.5281/zenodo.6914806).